Marginal Likelihood Estimation via Arrogance Sampling 



By Benedict Escoto 

Abstract 

This paper describes a method for estimating the marginal hkehhood or Bayes fac- 
tors of Bayesian models using non-parametric importance sampling ("arrogance sam- 
pling"). This method can also be used to compute the normalizing constant of probabil- 
ity distributions. Because the required inputs are samples from the distribution to be 
normalized and the scaled density at those samples, this method may be a convenient 
replacement for the harmonic mean estimator. The method has been implemented in 
the open source R package margLikArrogance. 



1 Introduction 

When a Bayesian evaluates two competing models or theories, Ti and T2, having observed 
a vector of observations Bayes' Theorem determines the posterior ratio of the models' 
probabilities: 

p{T^\x) ^ p{x\T,)p{T,) 

p{T2\x) p{x\T2)p{T2)- ^ ^ 

The quantity is called a Bayes factor and the quantities p{x\Ti) and p{x\T2) are called 

the theories' marginal likelihoods. 

The types of Bayesian models considered in this paper have a fixed finite number of 
parameters, each with their own probability function. If are parameters for a model T, 
then 

p{x\T) = j p{x\0,T)p{e\T)d0 = jp{xAe\T)dO (2) 

Unfortunately, this integral is difficult to compute in practice. The purpose of this paper is 
to describe one method for estimating it. 

Evaluating integral ^ is sometimes called the problem of computing normalizing con- 
stants. The following formula shows how p{x\T) is a normalizing constant. 

Thus the marginal likelihood p{x\T) is also the normalizing constant of the posterior pa- 
rameter distribution p{0\x,T) assuming we are given the density p{6 A x\T) which is often 
easy to compute in Bayesian models. Furthermore, Bayesian statisticians typically produce 
samples from the posterior parameter distribution p{0\x,T) even when not concerned with 
theory choice. In these case, computing the marginal likelihood is equivalent to computing 
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the normalizing constant of a distribution from which samples and the scaled density at these 
samples are available. The method described in this paper takes this approach. 

2 Review of Literature 

Given how basic ([T]) is, it is perhaps surprising that there is no easy and definitive way of 
applying it, even for simple models. Furthermore, as the dimensionality and complexity 
of probability distributions increase, the difficulty of approximation also increases. The 
following three techniques for computing bayes factors or marginal likelihoods are important 
but will not be mentioned further here. 

1. Analytic asymptotic approximations such as Laplace's method, see for instance Kass 
and Raftery (1995), 

2. Bridge sampling/path sampling/thermodynamic integration (Gelman and Meng, 1998), 



3. Chib's MCMC approximation (Chib, 1995; Chib and Jehazkov, 2005). 

Kass and Raftery (1995) is a popular overview of the earlier literature on Bayes factor 
computation. All these methods can be very successful in the right circumstances, and can 
often handle problems too complex for the method described here. However, the method of 
this paper may still be useful due to its convenience. 

The rest of section [2] describes three approaches that are relevant to this paper. 

2.1 Importance Sampling 

Importance sampling is a technique for reducing the variance of monte carlo integration. 
This section will note some general facts; see Owen and Zhou (1998) for more information. 

Suppose we are trying to compute the (possibly multidimensional) integral J of a well- 
behaved function f{0). Then 



so if g{6) is a probability density function and 0i are independent samples from it, then 



In is an unbiased approximation to / and by the central limit theorem will tend to a normal 
distribution. It has variance 



and 





n 
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dO 




dO 



(5) 



Sometimes / is called the target and g is called the proposal distribution. 

Assuming that / is non-negative, then minimum variance (of 0!) is achieved when g = 
f /I — in other words when g is just the normalized version of /. This cannot be done in 
practice because normalizing / requires knowing the quantity / that we wanted to approxi- 
mate; however ([S]) is still important because it means that the more similar the proposal is 
to the target, the better our estimator /„ becomes. In particular, / must go to faster than 
g or the estimator will have infinite variance. 

To summarize this section: 

1. Importance sampling is a monte carlo integration technique which evaluates the target 
using samples from a proposal distribution. 

2. The estimator is unbiased, normally distributed, and its variance (if not or infinity) 
decreases as 0{n~^) (using big-0 notation). 

3. The closer the proposal is to the target, the better the estimator. The proposal also 
needs to have longer tails than the target. 

2.2 Nonparametric Importance Sampling 

A difficulty with importance sampling is that it is often difficult to choose a proposal dis- 
tribution g. Not enough is known about / to choose an optimal distribution, and if a bad 
distribution is chosen the result can have large or even infinite variance. One approach to 
the selection of proposal g is to use non-parametric techniques to build g from samples of /. 
I call this class of techniques self-importance sampling, or arrogance sampling for short, 
because they attempt to sample / from itself without using any external information. (And 
also isn't it a bit arrogant to try to evaluate a complex, multidimensional integral using only 
the values at a few points?) The method of this paper falls into this class and particularly 
deserves the name because the target and proposal (when they are both non-zero) have 
exactly the same values up to a multiplicative constant. 

Two papers which apply nonparametric importance sampling to the problem of marginal 
likelihood computation (or computation of normalizing constants) are Zhang (1996) and 
Neddermeyer (2009). Although both authors apply their methods to more general situations, 
here I will use the framework suggested by ^ and assume that we can compute p{0Ax\T) for 
arbitrary and also that we can sample from the posterior parameter distribution p{6\x, T). 
The goal is to estimate the normalizing constant, the marginal likelihood p{x\T). 

Zhang's approach is to build the proposal g using traditional kernel density estimation, 
m samples are first drawn from p{0\x, T) and used to construct g. Then n samples are drawn 
from g and used to evaluate p{x\T) as in traditional importance sampling. This approach 
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is quite intuitive because kernel estimation is a popular way of approximating an unknown 
function. Zhang proves that the variance of his estimator decreases as 0{mi+dn~^) where 
d is the dimensionality of 0, compared to 0{n~^) for standard (parametric) importance 
sampling. 

There were, however, a few issues with Zhang's method: 

1. A kernel density estimate is equal to at points far from the points the kernel estimator 
was built on. This is a problem because importance sampling requires the proposal 
to have longer tails than the target. This fact forces Zhang to make the restrictive 
assumption that p{0\x,T) has compact support. 

2. It is hard to compute the optimal kernel bandwidth. Zhang recommends using a 
plug-in estimator because the function p{6 A x\T) is available, which is unusual for 
kernel estimation problems. Still, bandwidth selection appears to require significant 
additional analysis. 

— 4 

3. Finally, although the variance may decrease as 0{m'^n^^) as m increases, the diffi- 
culty of computing g{6) also increases with m, because it requires searching through 
the m basis points to find all the points close to 0. In multiple dimensions, this prob- 

-4 

lem is not trivial and may outweigh the 0(m4+3) speedup (in the worst case, practical 
evaluation of g{0) at a single point may be 0{m)). See Zlochin and Baram (2002) for 
some discussion of these issues. 

Neddermeyer (2009) uses a similar approach to Zhang and also achieves a variance of 

— 4 

0{mWn~^). It improves on Zhang's approach in two ways relevant to this paper: 

1. The support of p{6\x,T) is not required to be compact. 

2. Instead of using kernel density estimators, linear blend frequency polynomials (LBFPs) 
are used instead. LBFPs are basically histograms whose density is interpolated between 
adjacent bins. As a result, the computation of g{0) requires only finding which bin 
is in, and looking up the histogram value at that and adjacent bins (2*^ bins in total). 

As we will see in section [3} the arrogance sampling described in this paper is similar to 
the methods of Zhang and Neddermeyer. 

2.3 Harmonic Mean Estimator 

The harmonic mean estimator is a simple and notorious method for calculating marginal 
likelihoods. It is a kind of importance sampling, except the proposal g is actually the 
distribution p{0\x, T) = p{0 A x\T)/p{x\T) to be normalized and the target / is the known 
distribution p{0\T). Then if 0i are samples from p{0\x,T), we apparently have 
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n 

n ^ 



i=l 



p{0i\x,T) 




1=1 



p{x\e,,T)pid,\T)/p{x\T) 



pmT) 



n 



1 



n 



p{x\d„T)/p{x\T) 



1 



hence 




p{x\ei,T) 



1 



) 



(6) 



Two advantages of the harmonic mean estimator are that it is simple to compute and 
only depends on samples from p{0\x,T) and the likelihood p{x\6,T) at those samples. The 
main drawback of the harmonic mean estimator is that it doesn't work — as mentioned earlier 
the importance sampling proposal distribution needs to have longer tails than the target. In 
this case the target p{0\T) typically has longer tails than the proposal p{0\x,T) and thus 
(|6| has infinite variance. Despite not working, the harmonic mean estimator continues to be 
popular (Neal, 2008). 

3 Description of Technique 

This paper's arrogance sampling technique is a simple method that applies the nonparametric 
importance techniques of Zhang and Neddermeyer in an attempt to develop a method almost 
as convenient as the harmonic mean estimator. 

The only required inputs are samples Oi from p{6\x,T) and the values p{Oi A x\T) = 
p{x\0i,T)p{6i\T). This is similar to the harmonic mean estimator, but perhaps slightly less 
convenient because p{6i A x\T) is required instead of p{x\6i,T). 

There are two basic steps: 

1. Take m samples from p{6\x, T) and using modified histogram density estimation, con- 
struct probability density function f{6). 

2. With n more samples from p{6\x, T), estimate 1 /p{x\T) via importance sampling with 
target / and proposal p{6\x,T). 

These steps are described in more detail below. 
3.1 Construction of the Histogram 

Of the N total samples 6i from p{0\x^ T), the first m will be used to make a histogram. The 
optimal choice of m will be discussed below, but in practice this seems difficult to determine. 
An arbitrary rule of min(0.2A^, 2\/N) can be used in practice. 



5 



3.2 Importance Sampling 



3 DESCRIPTION OF TECHNIQUE 



With a traditional histogram, the only available information is the location of the sampled 
points. In this case we also know the (scaled) heights p{6 A x\T) at each sampled point. We 
can use this extra information to improve the fit. 

Our "arrogant" histogram / is constructed the same as a regular histogram, except the bin 
heights are not determined by the number of points in each bin, but rather by the minimum 
density over all points in the bin. If a bin contains no sampled points, then f{6) = for 
in that bin. Then / is normalized so that J f{6) dO = 1. 

To determine our bin width, we can simply and somewhat arbitrarily set our bin width 
h so that the histogram is positive for 50% of the sampled points from the distribution 
p{0\x, T). To approximate h, we can use a small number of samples (say, 40) from p{0\x, T) 
and set h so that f{6) > for exactly half of these samples. 

Figure [T] compares the traditional and new histograms for a one dimensional normal 
distribution based on 50 samples. The green rug lines indicate the 50 sampled points which 
are the same for all. The arrogant histogram's bin width is chosen as above. The traditional 
histogram's optimal bin width was determined by Scott's rule to minimize mean squared 
error. As the figure shows, the modified histogram is much smoother for a given bin width, 
so a smaller bin width can be used. On the other hand, / will either equal or have 
about twice the original density at each point, while the traditional histogram's density is 
numerically close to the original density. 



3.2 Importance Sampling 

The remaining n = N — m — 40 sampled points can be used for importance sampling. Using 
equation (jij) with histogram / as our target and p{0\x,T) as the proposal, we have 

" n^p{ei\x,T) n ^ p{e, Ax\T)/p{x\T) 

hence 



p(.|T)«p(.|r)//„=(i|:^;^^^) =^„ (7) 

To underscore the self-important/arrogant nature of this approximation An, we can rewrite 
(0) as 

, f ^ v"^ mm\p(0-i A x\T) : 0j and Oj are in the same bin|\ 

pi-m - (- E pf^;^^^^^ j 

where H is the histogram normalizing constant. This equation shows that all the values in the 
numerator and the denominator of our importance sampling are from the same distribution 
p{e Ax\T). 
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4 VALIDITY OF METHOD 



Note that the histogram / is the target of the importance samphng and p{0 A x\T) is the 
proposal. This is backwards from the usual scheme where the unknown distribution is the 
target and the known distribution is the proposal. Instead here the unknown distribution is 
the proposal, as in the harmonic mean estimator (see Robert and Wraith (2009) for another 
example of this.) 

As in section 2.1 , our approximation of p{x\T)~^ tends to a normal distribution as n — )■ oo 
by the central limit theorem. This fact can be used to estimate a confidence interval around 
p{x\T). 



4 Validity of Method 

This section will investigate the performance of the method. First, note that this method is 
just an implementation of importance sampling, so should converge to p{x\T)~^ with 
finite variance as long as the proposal density p{6\x,T) exists and is finite and positive on 
the compact region where the target histogram density is positive. 

To calculate the speed of convergence we will use equation (|5]) where / is the histogram, 
g{6) = p{6\x,T), and 1 = 1 because the histogram has been normalized. Unless otherwise 
noted, we will assume below that (7 : M'^ — t- M is finite, twice differentiable and positive, and 

that f M^t/6» is finite. 

J 0(6) 



4.1 Histogram Bin Width 

One important issue will be how quickly the d-dimensional histogram's selected bin width 
h goes to as the number of samples m — )■ 00. This section will only offer an intuitive 
argument. For any m, the histogram will enclose about the same probability (^) and will 
have about the same average density in a fixed region. Each bin has volume h'^, so if I is the 
number of bins then Ih'^ = 0(1) and h oc 

Furthermore, the distribution of the sampled points converges to the actual distribution 
g{0). If m > 0(/), an unbounded number of sampled points would end up in each bin. If 
m < 0{l), then some bins would have no points in them. Neither of these is possible because 
exactly one sampled point is necessary to establish each bin. Thus m oc / and h oc m~'^. 



4.2 Conditional Variance 

Before estimating the convergence rate of An we will prove something about the conditional 
variance of importance sampling. Let A = {6 : f{6) > 0}, 1a be the characteristic function 
of A, and q = J^g{6) dd. Define 

n (ft^ = l 9i0)/q if ^ e A 
^^^^ \ otherwise 
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Then is the density of g conditional on / > 0. Define Var^ and to mean the variance 
and expectation conditional on f{6) > 0. Thus 



V^r{f{e)/giO)) = Vs.T{EifiO)/g{0)\U)) + EiVaT{f{0)/giO)\U)) 

EA{f{d)/g{6)) if 6eA 
otherwise 

VarAifie)/g{e)) ii e A 
otherwise 



Var 
+ E 



- ^-(? otherwte)+^^--^^^^)/^^^)) 
= {l/q)'q{l -q) + -VaTA{f{0)/qg{0)) 

q 

= ^ + -Var^(/(0)/(?^(0)) 

q q 

We will assume below that g = ^, so that 

YaT{f{e)/g{e)) = 1 + 2VaTA{f{e)/gA{e)) (8) 

4.3 Importance Sampling Convergence 

With /, g, and A as defined above, / and gA have the same domain. Assuming errors in 
estimating q and normalization errors are of a lesser order of magnitude, we can treat the 
histogram heights as being sampled from gA- Suppose the histogram has / bins {Bj}, each 
with width h and based around the points gA{Oj). Then by equation ([s]). 



var.(/(e)/j,(e)) = 5: / 7f d0 



[ ^9A{e) + VgAje) ■ [6, -6) + am - o?) - aAm^ ^ 

j=i " ' 



9A{e) 

{VgA{e)-{e,-e)f + o{{e,-ef) 



^ 9a{0) 



< 



j=i - 

9a[0) 



3=1 " 
U2 
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5 IMPLEMENTATION ISSUES 



Because h (xm where d is the number of dimensions, and m is the number of samples 
used to make the histogram. 



YaiA{f{e)/gA{e)) < Cm-'^l^ 
Vaiiln) = Vaiipix\T)/An) = n-\l + 0{Cm-^/'')) (9) 



where C cc J ^-^^-^^S^^dO. Putting this together with (8), we get 



5 Implementation Issues 

5.1 Speed of Convergence 

The variance of n^^(l + 0(Cm^^/^)) given by (joj) is asymptotically equal to n~^, which is 
the typical importance sampling rate. In practice however, the asymptotic results cannot 
distinguish useful from impractical estimators. If Cm~'^^^ is small and Vai {p(x\T) / An) ~ 
n~^, then p{x\T) can be approximated in only 1000 samples to about 6% = -^^^ with 95% 
confidence. For many theory choice purposes, this is quite sufficient. Thus in typical problem 
cases the factor of Cm"^/*^ will be very significant. If Cm~^^'^ ^ 1, then the convergence 
rate may in practice be similar to n^^m''^^'^. Compare this to the rate of n~^m~'^^^'^~^'^^ for 
the methods proposed by Zhang and Neddermeyer. 

This method also uses simple histograms, instead of a more sophisticated density es- 
timation method (Zhang uses kernel estimation, Neddermeyer uses linear blend frequency 
polynomials). Although simple histograms converge slower for large d as shown above, they 
are much faster to compute for large d. 

Neddermeyer's LBFP algorithm is quite efficient compared to Zhang's, but its running 
time is 0{2 d 71'^+'^). d is a constant for any fixed problem, but if, say, d = 10, then the 
dimensionality constant multiplies the running time by 2^°10^ ~ 10^. 

By contrast, this paper's method takes only 0{dmlog{m)) time to construct the initial 
histogram, and an additional 0{dnlog{m)) time to do the importance sampling. The main 
reason for the difference is that querying a simple histogram can be done in log(m) time by 
computing the bin coordinates and looking up the bin's height in a tree structure. However, 
querying a LBFP requires blending all nearby bins and is thus exponential in d. 

5.2 When g = 

Our discussion assumed that g{6) = p{6\x,T) was always positive. If g goes to where the 
histogram is positive, the variance of A~^ will be infinite. However, this paper's method can 
still be used if g{6) is over some well-defined area. 
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For instance, suppose one dimension 6k of p{6\T) is defined by a gamma distribution, so 
that p{6k\T) = if and only if 6k < 0. Then we can ensure the variance is not infinite by 
checking that the histogram is only defined where > e > for some fixed e. 

The margLikArrogance package contains a simple mechanism to do this. The user may 
specify a range along each dimension of 6 where it is known that g > 0. If the histogram is 
non-zero outside of this range, the method aborts with an error. 

Note that the variance of the estimator increases with J ^^^j^^^^dO. In practice the 
estimator will work well only when g doesn't go to too quickly where the histogram is 
positive. In these cases the histogram will be defined well away from any region where g = 
and infinite variance won't be an issue even if = somewhere. 

5.3 Bin Shape 

Cubic histogram bins were used above — their widths were fixed at h in each dimension. Al- 
though the asymptotic results aren't affected by the shape of each bin, for usable convergence 
rates the bins' dimensions need to compatible with the shape of the high probability region 
oi p{6\x,T). Unfortunately, it is difficult to determine the best bin shapes. 

The margLikArrogance package contains a simple workaround: by default the distribu- 
tion is first scaled so that the sampled standard deviation along each dimension is constant. 
This is equivalent to setting each bin's width by dimension in proportion to that dimension's 
standard deviation. If this simple rule of thumb is insufficient, the user can scale the sampled 
values of p{6\x,T) manually (and make the corresponding adjustment to the estimate An). 

6 Conclusion 

This paper has described an "arrogance sampling" technique for computing the marginal 
likelihood or Bayes factor of a Bayesian model. It involves using samples from the model's 
posterior parameter distribution along with the scaled values of the distribution's density at 
those points. These samples are divided into two main groups: m samples are used to build 
a histogram; n are used to importance sample the histogram using the posterior parameter 
distribution as the proposal. 

This method is simple to implement and runs quickly in 0{d{m + ?T,)log(m)) time. Its 
asymptotic convergence rate, n^^(l + 0{Cm^'^^^)), is not remarkable, but in practice con- 
vergence is fast for many problems. Because the required inputs are similar to those of the 
harmonic mean estimator, it may be a convenient replacement for it. 
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